Computer software products for nucleic acid hybridization analysis

ABSTRACT

Methods and computer software products are provided for analyzing gene expression data In one embodiment, multiple probes are used to detect a single transcript The hybridization intensities of each probe is adjusted by dividing the intensities by the affinities of the probes The minimal adjusted hybridization intensity may be used as the measurement of the gene expression

RELATED APPLICATIONS

This application claims the priority of U.S. Provisional Application No.60/252,808, filed on Nov. 22, 2000, which is incorporated herein byreference for all purposes.

FIELD OF INVENTION

This invention is related to bromformatics and biological data analysis.Specifically, this invention provides methods, computer softwareproducts and systems for the analysis of biological data.

BACKGROUND OF THE INVENTION

Many biological functions are carried out by regulating the expressionlevels of various genes, either through changes in the copy number ofthe genetic DNA, through changes in levels of transcription (e g.through control of initiation, provision of RNA precursors, RNAprocessing, etc.) of particular genes, or through changes in proteinsynthesis. For example, control of the cell cycle and celldifferentiation, as well as diseases, are characterized by thevariations in the transcription levels of a group of genes.

Recently, massive parallel gene expression monitoring methods have beendeveloped to monitor the expression of a large number of genes usingnucleic acid array technology which was described in detail m, forexample, U.S. Pat. No. 5,871,928; de Saizieu, et al, 1998, BacteriaTranscript Imaging by Hybridization of total RNA to OligonucleotideArrays, NATURE BIOTECHNOLOGY, 16.45-48; Wodicka et al., 1997,Genome-wide Expression Monitoring in Saccharomyces cerevisiae, NATUREBIOTECHNOLOGY 15:1359-1367; Lockhart et al., 1996, Expression Monitoringby Hybridization to High Density Oligonucleotide Arrays NATUREBIOTECHNOLOGY 14.1675-1680; Lander, 1999, Array of Hope,NATURE-GENETICS, 21(suppl.), at 3.

Massive parallel gene expression monitoring experiments generateunprecedented amounts of information. For example, a commerciallyavailable GeneChip® array set is capable of monitoring the expressionlevels of approximately 6,500 murine genes and expressed sequence tags(ESTs) (Affymetrix, Inc, Santa Clara, Calif., USA). Array sets forapproximately 60,000 human genes and EST clusters, 24,000 rattranscripts and EST clusters and arrays for other organisms are alsoavailable from Affymetrix. Effective analysis of the large amount ofdata may lead to the development of new drugs and new diagnostic tools.Therefore, there is a great demand in the art for methods fororganizing, accessing and analyzing the vast amount of informationcollected using massive parallel gene expression monitoring methods.

SUMMARY OF THE INVENTION

The current invention provides methods, systems and computer softwareproducts suitable for analyzing data from gene expression monitoringexperiments that employ multiple probes against a single target.

Computer implemented methods for determining hybridization between aplurality of nucleic acid probes and a nucleic acid target are provided.The methods are useful for analyzing any hybridization between multipleprobes and a target nucleic acid It is particularly useful for analyzinggene expression experiments where a single transcript is determinedusing multiple probes

In some embodiments, the method include steps of inputting a pluralityof hybridization intensities, each of the intensities reflects thehybridization between one of the plurality of the probes and the nucleicacid target, adjusting the hybridization intensities for hybridizationaffinities of the probes to obtain a plurality of adjusted hybridizationintensities, finding the minimal adjusted hybridization intensity amongthe adjusted hybridization intensities, and indicating the minimaladjusted hybridization intensity as a measurement of the hybridization.The hybridization affinities of the probes may be predicted based uponthe sequence of the probes The hybridization affinities may be inputtedfrom a database where experimentally determined hybridization affinitiesare stored The adjusted hybridization intensity are calculated accordingto${{{Adjusted}\quad{hybridization}\quad{intensity}} = \frac{I}{\Gamma}},$

where I is hybridization intensity and Γ is hybridization affinity

In another aspect of the invention, computer software products areprovided for determining hybridization between nucleic acid probes and anucleic acid target A software product may include a computer-readablemedium having computer-executable instructions for performing the methodof the invention

In some embodiments, the software products may include computer programcode for inputting a plurality of hybridization intensities, each of thehybridization intensities reflects the hybridization between one of theplurality of the probes and the nucleic acid target, computer programcode for adjusting the hybridization intensities for hybridizationaffinities of the probes to obtain a plurality of adjusted hybridizationintensities, computer program code for finding the minimal adjustedhybridization intensity among the adjusted hybridization intensities,and computer program code for indicating the minimal adjustedhybridization intensity as a measurement of said hybridization, and acomputer readable media for storing the codeThe hybridization affinities of the probes may be predicted based uponthe sequence of said probes and the software products contain code forperforming the prediction Alternatively, the predicted hybridizationaffinities may be inputted In some embodiments, the hybridizationaffinities are inputted from a database Hybridization affinities mayalso be measured experimentally In preferred embodiments, the adjustedhybridization intensity may be calculated according to${{{Adjusted}\quad{hybridization}\quad{intensity}} = \frac{I}{\Gamma}},$

where I is hybridization intensity and said Γ is the hybridizationaffinity In yet another aspect of the invention, systems for analyzingnucleic acid hybridization are provided In some embodiments, the systemmay include a processor, and a memory being coupled to the processor,the memory storing a plurality machine instructions that cause theprocessor to perform the method of the invention

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and form a part ofthis specification, illustrate embodiments of the invention and,together with the description, serve to explain the principles of theinvention

FIG. 1 illustrates an example of a computer system that may be utilizedto execute the software of an embodiment of the invention

FIG. 2 illustrates a system block diagram of the computer system of FIG.1

FIG. 3 shows one embodiment of the gene expression analysis method ofthe invention

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Reference will now be made in detail to the preferred embodiments of theinvention While the invention will be described in conjunction with thepreferred embodiments, it will be understood that they are not intendedto limit the invention to these embodiments On the contrary, theinvention is intended to cover alternatives, modifications andequivalents, which may be included within the spirit and scope of theinvention All cited references, including patent and non-patentliterature, are incorporated herein by reference in their entireties forall purposes

I. Gene Expression Monitoring With High Density Oligonucleotide ProbeArrays

High density nucleic acid probe arrays, also referred to as “DNAMicroarrays,’ have become a method of choice for monitoring theexpression of a large number of genes As used herein, “Nucleic acids”may include any polymer or oligomer of nucleosides or nucleotides(polynucleotides or oligonucleotides), which include pyrimidine andpurine bases, preferably cytosine, thymine, and uracil, and adenine andguanine, respectively See Albert L. Lehninger, PRINCIPLES OFBIOCHEMISTRY, at 793-800 (Worth Pub 1982) and L. Stryer BIOCHEMISTRY,4^(th) Ed, (March 1995), both incorporated by reference “Nucleic acids”may include any deoxyribonucleotide, ribonucleotide or peptide nucleicacid component, and any chemical variants thereof, such as methylated,hydroxymethylated or glucosylated forms of these bases, and the like Thepolymers or oligomers may be heterogeneous or homogeneous incomposition, and may be isolated from naturally-occurring sources or maybe artificially or synthetically produced In addition, the nucleic acidsmay be DNA or RNA, or a mixture thereof, and may exist permanently ortransitionally in single-stranded or double-stranded form, includinghomoduplex, heteroduplex, and hybrid states

“A target molecule” refers to a biological molecule of interest Thebiological molecule of interest can be a ligand, receptor, peptide,nucleic acid (oligonucleotide or polynucleotide of RNA or DNA), or anyother of the biological molecules listed in U.S. Pat. No. 5,445,934 atcol 5, line 66 to col 7, line 51 For example, if transcripts of genesare the interest of an experiment, the target molecules would be thetranscripts Other examples include protein fragments small molecules,etc “Target nucleic acid” refers to a nucleic acid (often derived from abiological sample) of interest. Frequently, a target molecule isdetected using one or more probes As used herein, a “probe” is amolecule for detecting a target molecule It can be any of the moleculesin the same classes as the target referred to above A probe may refer toa nucleic acid, such as an oligonucleotide, capable of binding to atarget nucleic acid of complementary sequence through one or more typesof chemical bonds, usually through complementary base pairing, usuallythrough hydrogen bond formation As used herein, a probe may includenatural (i e A, G, U, C, or T) or modified bases (7-deazaguanosine,inosine, etc). In addition, the bases in probes may be joined by alinkage other than a phosphodiester bond, so long as the bond does notinterfere with hybridization Thus, probes may be peptide nucleic acidsin which the constituent bases are joined by peptide bonds rather thanphosphodiester linkages Other examples of probes include antibodies usedto detect peptides or other molecules, any ligands for detecting itsbinding partners When referring to targets or probes as nucleic acids,it should be understood that there are illustrative embodiments that arenot to limit the invention in any way

In preferred embodiments, probes may be immobilized on substrates tocreate an array An “array” may comprise a solid support with peptide ornucleic acid or other molecular probes attached to the support Arraystypically comprise a plurality of different nucleic acids or peptideprobes that are coupled to a surface of a substrate in different, knownlocations These arrays, also described as “microarrays” or colloquially“chips” have been generally described in the art, for example, in Fodoret al, Science, 251 767-777 (1991), which is incorporated by referencefor all purposes Methods of forming high density arrays ofoligonucleotides, peptides and other polymer sequences with a minimalnumber of synthetic steps are disclosed in, for example, U.S. Pat. Nos.5,143,854, 5,252,743, 5,384,261, 5,405,783, 5,424,186, 5,429,807,5,445,943, 5,510,270, 5,677,195, 5,571,639, 6,040,138, all incorporatedherein by reference for all purposes The oligonucleotide analogue arraycan be synthesized on a solid substrate by a variety of methods,including, but not limited to, light-directed chemical coupling, andmechanically directed coupling See Pirrung et al, U.S. Pat. No.5,143,854 (see also PCT Application No WO 90/15070) and Fodor et al.,PCT Publication Nos WO 92/10092 and WO 93/09668, U.S. Pat. Nos.5,677,195, 5,800,992 and 6,156,501 which disclose methods of formingvast arrays of peptides, oligonucleotides and other molecules using, forexample, light-directed synthesis techniques See also, Fodor et al.,Science, 251, 767-77 (1991) These procedures for synthesis of polymerarrays are now referred to as VLSIPS™ procedures Using the VLSIPS™approach, one heterogeneous array of polymers is converted, throughsimultaneous coupling at a number of reaction sites, into a differentheterogeneous array See, U.S. Pat. Nos. 5,384,261 and 5,677,195

Methods for making and using molecular probe arrays, particularlynucleic acid probe arrays are also disclosed in, for example, U.S. Pat.Nos. 5,143,854, 5,242,974, 5,252,743, 5,324,633, 5,384,261, 5,405,783,5,409,810, 5,412,087, 5,424,186, 5,429,807, 5,445,934, 5,451,683,5,482,867, 5,489,678, 5,491,074, 5,510,270, 5,527,681, 5,527,681,5,541,061, 5,550,215, 5,554,501, 5,556,752, 5,556,961, 5,571,639,5,583,211, 5,593,839, 5,599,695, 5,607,832, 5,624,711, 5,677,195,5,744,101, 5,744,305, 5,753,788, 5,770,456, 5,770,722, 5,831,070,5,856,101, 5,885,837, 5,889,165, 5,919,523, 5,922,591, 5,925,517,5,658,734, 6,022,963, 6,150,147, 6,147,205, 6,153,743, 6,140,044 andD430024, all of which are incorporated by reference in their entiretiesfor all purposes Typically, a nucleic acid sample is a labeled with asignal moiety, such as a fluorescent label The sample is hybridized withthe array under appropriate conditions The arrays are washed orotherwise processed to remove non-hybridized sample nucleic acids Thehybridization is then evaluated by detecting the distribution of thelabel on the chip. The distribution of label may be detected by scanningthe arrays to determine florescence intensities distribution Typically,the hybridization of each probe is reflected by several pixelintensities The raw intensity data may be stored in a gray scale pixelintensity file The GATC™ Consortium has specified several file formatsfor storing array intensity data. The final software specification isavailable at www gatcconsortium org and is incorporated herein byreference in its entirety The pixel intensity files are usually largeFor example, a GATC™ compatible image file may be approximately 50 Mb ifthere are about 5000 pixels on each of the horizontal and vertical axesand if a two byte integer is used for every pixel intensity The pixelsmay be grouped into cells (see, GATC™ software specification) The probesin a cell are designed to have the same sequence (i e, each cell is aprobe area) A CEL file contains the statistics of a cell, e.g., the 75percentile and standard deviation of intensities of pixels in a cell The50, 60, 65, 70, 80, 85 percentile of pixel intensity of a cell is oftenused as the intensity of the cell Methods for signal detection andprocessing of intensity data are additionally disclosed in, for example,U.S. Pat. Nos. 5,547,839, 5,578,832, 5,631,734, 5,800,992, 5,856,092,5,936,324, 5,981,956, 6,025,601, 6,090,555, 6,141,096, 6,141,096, and5,902,723 Methods for array based assays, computer software for dataanalysis and applications are additionally disclosed in, e g U.S. Pat.Nos. 5,527,670, 5,527,676, 5,545,531, 5,622,829, 5,631,128, 5,639,423,5,646,039, 5,650,268, 5,654,155, 5,674,742, 5,710,000, 5,733,729,5,795,716, 5,814,450, 5,821,328, 5,824,477, 5,834,252, 5,834,758,5,837,832, 5,843,655, 5,856,086, 5,856,104, 5,856,174, 5,858,659,5,861,242, 5,869,244, 5,871,928, 5,874,219, 5,902,723, 5,925,525,5,928,905, 5,935,793, 5,945,334, 5,959,098, 5,968,730, 5,968,740,5,974,164, 5,981,174, 5,981,185, 5,985,651, 6,013,440, 6,013,449,6,020,135, 6,027,880, 6,027,894, 6,033,850, 6,033,860, 6,037,124,6,040,138, 6,040,193, 6,043,080, 6,045,996, 6,050,719, 6,066,454,6,083,697, 6,114,116, 6,114,122, 6,121,048, 6,124,102, 6,130,046,6,132,580, 6,132,996, 6,136,269 and attorney docket numbers 3298 1 and3309, all of which are incorporated by reference in their entireties forall purposes

Nucleic acid probe array technology, use of such arrays, analysis arraybased experiments, associated computer software, composition for makingthe array and practical applications of the nucleic acid arrays are alsodisclosed, for example, in the following U.S. patent applications Ser.Nos. 07/838,607, 07/883,327, 07/978,940, 08/030,138, 08/082,937,08/143,312, 08/327,522, 08/376,963, 08/440,742, 08/533,582, 08/643,822,08/772,376, 09/013,596, 09/016,564, 09/019,882, 09/020,743, 09/030,028,09/045,547, 09/060,922, 09/063,311, 09/076,575, 09/079,324, 09/086,285,09/093,947, 09/097,675, 09/102,167, 09/102,986, 09/122,167, 09/122,169,09/122,216, 09/122,304, 09/122,434, 09/126,645, 09/127,115, 09/132,368,09/134,758, 09/138,958, 09/146,969, 09/148,210, 09/148,813, 09/170,847,09/172,190, 09/174,364, 09/199,655, 09/203,677, 09/256,301, 09/285,658,09/294,293, 09/318,775, 09/326,137, 09/326,374, 09/341,302, 09/354,935,09/358,664, 09/373,984, 09/377,907, 09/383,986 09/394,230, 09/396,196,09/418,044, 09/418,946, 09/420,805, 09/428,350, 09/431,964, 09/445,734,09/464,350, 09/475,209, 09/502,048, 09/510,643, 09/513,300, 09/516,388,09/528,414, 09/535,142, 09/544,627, 09/620,780, 09/640,962, 09/641,081,09/670,510, 09/685,011, and 09/693,204 and in the following PatentCooperative Treaty (PCT) applications/publications PCT/NL90/00081,PCT/GB91/00066, PCT/US91/08693, PCT/US91/09226, PCT/US91/09217,WO/93/10161, PCT/US92/10183, PCT/GB93/00147, PCT/US93/01152,WO/93/22680, PCT/US93/04145, PCT/US93/08015, PCT/US94/07106,PCT/US94/12305, PCT/GB95/00542, PCT/US95/07377, PCT/US95/02024,PCT/US96/05480, PCT/US96/11147, PCT/US96/14839, PCT/US96/15606,PCT/US97/01603, PCT/US97/02102, PCT/GB97/005566, PCT/US97/06535,PCT/GB97/01148, PCT/GB97/01258, PCT/US97/08319, PCT/US97/08446,PCT/US97/10365, PCT/US97/17002, PCT/US97/16738, PCT/US97/19665,PCT/US97120313, PCT/US97/21209, PCT/US97/21782, PCT/US97123360,PCT/US98/06414, PCT/US98/01206, PCT/GB98/00975, PCT/US98/04280,PCT/US98/04571, PCT/US98/05438, PCT/US98/05451, PCT/US98/12442,PCT/US98/12779, PCT/US98/12930, PCT/US98/13949, PCT/US98/15151,PCT/US98/15469, PCT/US98/15458, PCT/US98/15456, PCT/US98/16971,PCT/US98/16686, PCT/US99/19069, PCT/US98/18873, PCT/US98/18541,PCT/US98/19325, PCT/US98/22966, PCT/US98/26925, PCT/US98/27405 andPCT/IB99/00048, al of which are incorporated by reference in theirentireties for all purposes All the above cited patent applications andother references cited throughout this specification are incorporatedherein by reference in their entireties for all purposes

The embodiments of the invention will be described using GeneChip® higholigonucleotide density probe arrays (available from Affymetrix, Inc,Santa Clara, Calif., USA) as exemplary embodiments One of skill the artwould appreciate that the embodiments of the invention are not limitedto high density oligonucleotide probe arrays In contrast, theembodiments of the invention are useful for analyzing any parallel largescale biological analysis, such as those using nucleic acid probe array,protein arrays, etc

Gene expression monitoring using GeneChip® high density oligonucleotideprobe arrays are described in, for example, Lockhart et al, 1996,Expression Monitoring By Hybridization to High Density OligonucleotideArrays, Nature Biotechnology 14 1675-1680; U.S. Pat. Nos. 6,040,138 and5,800,992, all incorporated herein by reference in their entireties forall purposes

In the preferred embodiment, oligonucleotide probes are synthesizeddirectly on the surface of the array using photolithography andcombinational chemistry as disclosed in several patents previousincorporated by reference In such embodiments, a single square-shapedfeature on an array contains one type of probe Probes are selected to bespecific against desired target Methods for selecting probe sequencesare disclosed in, for example, U.S. patent application Ser. No.09/718,295, filed Nov. 21, 2000, U.S. patent application Ser. No.09/721,042, filed Nov. 21, 2000, and U.S. Patent Application No.60/252,617, filed Nov. 21, 2000. all incorporated herein by reference intheir entireties for all purposes

In a preferred embodiment, oligonucleotide probes in the high densityarray are selected to bind specifically to the nucleic acid target towhich they are directed with minimal non-specific binding orcross-hybridization under the particular hybridization conditionsutilized Because the high density arrays of this invention can containin excess of 1,000,000 different probes, it is possible to provide everyprobe of a characteristic length that binds to a particular nucleic acidsequence Thus, for example, the high density array can contain everypossible 20 mer sequence complementary to an IL-2 mRNA There, however,may exist 20 mer subsequences that are not unique to the IL-2 mRNAProbes directed to these subsequences are expected to cross hybridizewith occurrences of their complementary sequence in other regions of thesample genome Similarly, other probes simply may not hybridizeeffectively under the hybridization conditions (e g, due to secondarystructure, or interactions with the substrate or other probes) Thus, ina preferred embodiment, the probes that show such poor specificity orhybridization efficiency are identified and may not be included eitherin the high density array itself (e g, during fabrication of the array)or in the post-hybridization data analysis.

Probes as short as 15, 20, 25 or 30 nucleotides are sufficient tohybridize to a subsequence of a gene and that, for most-genes, there isa set of probes that performs well across a wide range of target nucleicacid concentrations In a preferred embodiment, it is desirable to choosea preferred or “optimum” subset of probes for each gene beforesynthesizing the high density array

In some preferred embodiments, the expression of a particular transcriptmay be detected by a plurality of probes, typically, up to 5, 10, 15,20, 30 or 40 probes Each of the probes may target different sub-regionsof the transcript However, probes may overlap over targeted regions

In some preferred embodiments, each target sub-region is detected usingtwo probes a perfect match (PM) probe that is designed to be completelycomplementary to a reference or target sequence In some otherembodiments, a PM probe may be substantially complementary to thereference sequence A mismatch (MM) probe is a probe that is designed tobe complementary to a reference sequence except for some mismatches thatmay significantly affect the hybridization between the probe and itstarget sequence In preferred embodiments, MM probes are designed to becomplementary to a reference sequence except for a homomeric basemismatch at the central (e g, 13^(th) in a 25 base probe) positionMismatch probes are normally used as controls for cross-hybridization Aprobe pair is usually composed of a PM and its corresponding MM probeThe difference between PM and MM provides an intensity difference in aprobe pair

II. Data Analysis Systems

In one aspect of the invention, methods computer software products andsystems are provided for computational analysis of microarray intensitydata for determining the presence or absence of genes in a givenbiological sample Accordingly, the present invention may take the formof data analysis systems, methods, analysis software, etc Softwarewritten according to the present invention is to be stored in some formof computer readable medium, such as memory, or CD-ROM, or transmittedover a network and executed by a processor For a description of basiccomputer systems and computer networks, see, e g, Introduction toComputing Systems From Bits and Gates to C and Beyond by Yale N Patt,Sanjay J Patel, 1st edition (Jan. 15, 2000) McGraw Hill Text, ISBN0072376902, and Introduction to Client/Server Systems: A Practical Guidefor Systems Professionals by Paul E Renaud, 2nd edition (June 1996),John Wiley & Sons, ISBN 0471133337

Computer software products may be written in any of various suitableprogramming languages, such as C C++. C# (Microsoft®), Fortran, Perl,MatLab (MathWorks, www mathworks com), SAS, SPSS and Java The computersoftware product may be an independent application with data input anddata display modules Alternatively, the computer software products maybe classes that may be instantiated as distributed objects The computersoftware products may also be component software such as Java Beans (SunMicrosystem), Enterprise Java Beans (EJB, Sun Microsystems), Microsoft®COM/DCOM (Microsoft®D), etc.

FIG. 1 illustrates an example of a computer system that may be used toexecute the software of an embodiment of the invention The computersystem described herein is also suitable for hosting a DBMS FIG. 1 showsa computer system 101 that includes a display 103, screen 105, cabinet107, keyboard 109, and mouse 111 Mouse 111 may have one or more buttonsfor interacting with a graphic user interface Cabinet 107 houses afloppy drive 112, CD-ROM or DVD-ROM drive 102, system memory and a harddrive (113) (see also FIG. 2) which may be utilized to store andretrieve software programs incorporating computer code that implementsthe invention, data for use with the invention and the like Although aCD 114 is shown as an exemplary computer readable medium, other computerreadable storage media including floppy disk, tape, flash memory, systemmemory, and hard drive may be utilized Additionally, a data signalembodied in a carrier wave (e.g, in a network including the Internet)may be the computer readable storage medium

FIG. 2 shows a system block diagram of computer system 101 used toexecute the software of an embodiment of the invention As in FIG. 1,computer system 101 includes monitor 201, and keyboard 209 Computersystem 101 further includes subsystems such as a central processor 203(such as a Pentium m processor from Intel), system memory 202, fixedstorage 210 (e g, hard drive), removable storage 208 (e.g, floppy orCD-ROM), display adapter 206, speakers 204, and network interface 211Other computer systems suitable for use with the invention may includeadditional or fewer subsystems For example, another computer system mayinclude more than one processor 203 or a cache memory Computer systemssuitable for use with the invention may also be embedded in ameasurement instrument

III. Analysis of Hybridization of Probe Sets and Their Targets

The method of the invention will be explained in great detail using theabove terminology associated with Affymetrix GeneChip® probe arrays Oneof skill in the art would appreciate that the method of the invention isgenerally applicable to biological analysis using multiple probes (orother means of obtaining multiple measurements against one biologicalvariable, such as level of a transcript, etc)

A typical situation for current implementation and usage for theGeneChip® probe array expression analysis is that there are 10, 15 or 20probe pairs for each gene and a group of experiments to be comparedamong each other It is apparent to those skilled in the art, the currentinvention is not limited to the number of probe pairs Preferably, themethods, systems and inventions are used to analyze data fromexperiments that employ at least two probe pairs, more preferably morethan five probe pairs

Some embodiments of the methods, systems and computer software productsof the invention is based upon an algorithm for analyzing geneexpression levels The following notations are used to describe preferredembodiments One of skill in the art would appreciate that the specificnotations and mathematical equations are provided for the purpose ofbest describing the invention The methods, systems and computer softwareproducts of the invention are not limited by the specific notations orequations

In gene expression experiment, the target transcripts are denoted as

t₁,t₂,t₃

If multiple experiments are conducted, the experiments are denoted as

E₁,E₂,E₃

Nucleic acid probe arrays (Chips) are denoted as

A₁,A₂,A₃

Probes on a chip are denoted as

p₁,p₂,p₃

-   -   X(P₃) is the x coordinate of the cell containing probe j    -   Y(P_(j)) is the y coordinate of the cell containing probe j    -   T=The set of all transcripts potentially existing in the target        solution for any experiment, T={t₁,t₂,t₃ . . . }    -   E_(t)(t_(j))=Concentration of transcript t_(j) in experiment        E_(t) X (P_(j))=(this will be zero for many combinations)    -   D(P_(j))=The transcript probe p_(j), D(p_(j))=t_(j)

A model is provided to relate the observed intensity for a particularprobe (such as an oligonucleotide sequence) on a chip, to thehybridization of that oligonucleotide to transcripts in the targetsolution The model explicitly describes the contribution of “perfectmatch” hybridization and cross hybridization to the measured intensity

For each probe P_(j)αI _(j)β=Γ_(j) C _(D(Pj))+χ_(j,T−{D(pj)})  (1)whereC_(D(PJ))=The concentration of the transcript measured by p_(j),C=E _(t)(D(pj)),C(E _(t))I_(j)=The measured intensity, I(E_(t),A_(t),P_(j))α=The spatial variation correction factor, α(X(p_(j)), Y(p_(j)),E_(t),A_(t))β The uniform offset (background) correction, β(E_(t),A_(t))Γ_(j)=The hybridization affinity for probe p_(j),Γ(p_(j),D(p_(J)))χ_(j,T−{D(PJ)})=The cross hybridization affinity probe p_(j)$\chi_{{jT} - {\{{D{({pj})}}\}}} = {\sum\limits_{{tk} \neq {D{({Pj})}}}{{E_{i}( t_{k} )}*{\delta( {t_{k},P_{j}} )}}}$

where δ(t_(k),p_(j)) is the affinity probe p_(j) to transcript t_(k)

It may be helpful to look at equation (1) without all the subscriptsαI−β=ΓC+χ  (2)

The left-hand side of equation (2) represents the measured intensity forprobe p_(j) after all uniform effects have been removed These uniformeffects do not depend on the sequence of the probe or on the sequencesin the target solution In other words, they are sequence independenteffects that depend on the experiment and on the manufacturingcharacteristics of the chip We will call the left-hand side of equation(1) the adjusted intensity

The right hand side of equation (2) describes the effect ofhybridization on the adjusted intensity The first term states that theadjusted intensity is a linear function of the target solutionconcentration Specifically It is a linear function of the transcript inthe target solution that contains the probe sequence The second termstates that the adjusted intensity is also proportional to the crosshybridization That is, to the hybridization of the probe to all othertranscripts in the target solution

Cross hybridization is not a uniform, sequence independent process It isnot eliminated when the adjusted intensity is computed It is a complexand unknown process but it is not random and uniform Correctly managingcross hybridization is the essential new concept in the new algorithms

To further simplify the notation, the adjusted intensity is designatedas I′ That isI′=ΓC+χ  (3)

Some embodiments of the algorithm assume that it is possible to predictthe value of Γ (hybridization affinity for probes) based on the sequenceof the probe Methods for predicting Γ (hybridization affinity forprobes) are described m, for example, U.S. patent application Ser. No.09/718,295, filed Nov. 21, 2000 and U.S. patent application Ser. No.09/721,042, filed Nov. 21, 2000, both incorporated herein by referencefor all purposes

In some embodiments, a physical model that is based on the thermodynamicproperties of the sequence is used to predict the array-basedhybridization intensities of the sequence Hybridization propensities maybe described by energetic parameters derived from the probe sequence,and variations in hybridization and chip manufacturing conditions willresult in changes in these parameters that can be detected and correctedThe values of weight coefficients in the physical model may bedetermined by empirical data because these values are influenced byassay conditions, which include hybridization and target fragmentation,and probe synthesis conditions, which include choice of substrates,coupling efficiency, etc

In one embodiment, a model experimental system is used to generateempirical data and a computational model is used to process these datato solve for the weight coefficients of the physical model. These solvedweight coefficients are in turn placed back into the physical model,enabling it to predict the hybridization behaviors of new sequences

The equation (3) is divided by the known quantity Γ to get$\frac{I^{\prime}}{\Gamma} = {C + \frac{\chi}{\Gamma}}$Because cross hybridization is difficult to be completely eliminated,χ≧0${\frac{I_{1}}{\Gamma_{1}} \geq {C\quad{and}\quad{that}\quad\frac{I_{1}\quad}{\Gamma_{1}}}} = {{C\quad{only}\quad{if}\quad\chi} = 0}$This means that ifI₁/Γ₁,I₂/Γ₂,I₃/Γ₃,is a collection of concentration estimates for a particular transcript,based on a collection of different probes for that transcript, then thebest estimate for the concentration of that transcript is$\begin{matrix}{{mm}{\{ {\frac{I_{1}}{\Gamma_{1}},\frac{I_{2}}{\Gamma_{2}},\frac{I_{3}}{\Gamma_{3}},\ldots} \}.}} & (4)\end{matrix}$

The probes corresponding to a transcript will all respond differently tocross hybridization, and that at least a few of them will have verylittle cross hybridization.

The minimization in equation (3) does not require any assumptions aboutthe stochastic behavior of cross hybridization This provides a greatadvantage since cross hybridization is not well modeled as a randomprocess

IV. Computer Implemented Methods, Computer Software and Systems forMultiple Probe Data Analysis

In one aspect of the invention, computer implemented methods are used toanalyze nucleic acid hybridization The methods are particularly suitablefor analyzing multiple probe array based gene expression analysis FIG. 3shows a process for some embodiments of the invention Intensity valuesfor a set of probes (I₁, I₂ . . . I_(n)) are inputted (301) The probeset is designed to interrogate one transcript in preferred embodiments,a probe set has at least 5, 10, 15 or 20 probes The probes may bedesigned to be a perfect match with the target transcript Alternatively,some of the probes may be designed as mismatch control The intensityvalues may be the measured values for the prefect match probesAlternatively, they may be the difference between the intensities forthe perfect match probes and those of the mismatch probes One of skillin the art would appreciate that the intensity values may be adjusted ornormalized for background, non-specific bindings, etc

The intensity values are adjusted using hybridization affinities of theprobes The predicted or measured hybridization affinity of the probeswith their target may be pre-calculated and stored in a database (302)Hybridization affinity may be measured experimentally by hybridizingprobes with their intended targets In addition, hybridization affinityfor probes may be predicted based upon the sequences of the probesMethods, software products and systems for predicting hybridizationaffinity of probes are disclosed in, for example, U.S. patentapplications Ser. No. ______, Attorney Docket Number 3359, filed on Nov.21, 2000, and Ser. No. ______, Attorney Docket Number 3367, filed onNov. 21, 2000, both incorporated herein by reference for all purposes

One of skill in the art would appreciate that the methods, softwareproducts and systems are limited to any particular model and methods forpredicting hybridization affinity of probes Rather, the currentinvention may employ any suitable methods for predicting hybridizationaffinity However, for illustration purposes, some preferred methods forpredicting hybridization affinity ale discussed below

In this particular method, a physical model that is based on thethermodynamic properties of the sequence is used to predict thearray-based hybridization intensities of the sequence Hybridizationpropensities may be described by energetic parameters derived from theprobe sequence, and variations in Hybridization and chip manufacturingconditions will result in changes in these parameters that can bedetected and corrected

The values of weight coefficients in the physical model may bedetermined by empirical data because these values are influenced byassay conditions, which include hybridization and target fragmentation,and probe synthesis conditions, which include choice of substrates,coupling efficiency, etc

Basically, a target (7) hybridizes to its complementary probe ( ) toform a probe-target duplex (P•T), and the reaction is accompanied withfavorable free energy change The amplitude of the free energy change(ΔG) determines the stability of probe-target duplex. The duplexstability can be described by equilibrium constant (K_(s)), which issequence-dependent The relationship between K_(c) and ΔG may be given byBoltzmann's equation $\begin{matrix}{K_{s} = {\frac{k_{on}}{k_{off}} = {\mathbb{e}}^{{- \Delta}\quad{G/{RT}}}}} & (4)\end{matrix}$where k_(on) and k_(off) are the rate constants for association anddissociation, respectively of the probe-target duplex, R is the gasconstant and T is the absolute temperature. According to Equation 4, ΔGis a function of the sequence The dependence of ΔG on probe sequence canbe quite complicated, but relatively simple models for ΔG have yieldedgood results

There are a number of ways to establish the relationship between thesequence and ΔG In preferred embodiments, Nathan Hunt's simple model(See, U.S. application Ser. No. 09/721,042, filed Nov. 21, 2000,previously incorporated by reference) works the best in some embodimentsof the invention $\begin{matrix}{{\Delta\quad G_{seq}} = {\sum\limits_{i = 1}^{3\quad N}{P_{1}S_{i}}}} & (5) \\{{\Delta\quad G_{sep}} = {{\sum\limits_{i = 1}^{3N}{P_{1}S_{i}}} + C}} & (6)\end{matrix}$where N is the length (number of bases) of a probe P₁ is the value ofthe ith parameter which reflects the ΔG of a base in a given sequenceposition relative to a reference base in the same position In preferredembodiments, the reference base is A In this case, the Pi's will be thefree energy of a base in a given position relative to base A in the sameposition

Based on the simple hybridization scheme described above, theHybridization intensity is proportional to the concentration ofprobe-target duplex, where C₀ is constant. Under equilibrium condition,the intensity is directly related to ΔG This relationship is alsoexpressed in natural logarithm form, where C₁ and C₂ are constants Therelationship between intensity and probe sequence is described belowI = C₀[P ⋅ T][P ⋅ T] = K_(s)[P][T] = 𝕖^(−Δ  G/RT)[P][T]${{Ln}\quad 1} = {{{{- \Delta}\quad{G/{RT}}} + {{Ln}\{ {{C_{0}\lbrack P\rbrack}\lbrack T\rbrack} \}{{Ln}\quad 1}}} = {{{C_{1}{\sum\limits_{i = 1}^{3N}{P_{i}S_{i}}}} + {C_{2}{Ln}\quad 1}} = {{{\sum\limits_{i = 1}^{3N}{C_{1}P_{i}S_{i}}} + C_{2}} = {{\sum\limits_{i = 1}^{3N}{W_{i}S_{i}}} + C_{2}}}}}$where W_(i)=C₁P_(i) The following is a linear regression model forprobes of N bases in length using a training data set that containsintensity values of M probesLn(I ₁)=W ₁ S ₁₁ +W ₂ S ₂₁ +W _(3N) S _(3N1)Ln(I ₂)=W ₁ S ₁₂ +W ₂ S ₂₂ +W _(3N) S _(3N2)Ln(I ₁)=W ₁ S ₁₁ +W ₂ S ₁₂ +W _(3N) S _(3N1)

Hybridization intensities (relative to a reference base, such as an A)for each type of bases can be solved at each position in the probesequence may be predicted Multiple linear regression analysis is wellknown in the art, see, for example, electronic statistic book (http//wwwstatsofthe com/textbook/stathome html). Darlington, R B. (1990)Regression and linear models New York McGraw-Hill, both incorporated byreference for all purposes Computer software packages, such as SAS, SPSSand MatLib5 3 provide multiple linear regression functions In addition,computer software code examples suitable for performing multiple linearregression analysis are provided in, for example, the Numerical Recipes(NR) books developed by Numerical Recipes Software and published byCambridge University Press (CUP, with K and U.S. web sites)

In a preferred embodiment, a set of probes of different sequences(probes 1 to M) is used as probes in experiments(s) Hybridizationaffinities (relative ΔG or Ln (I)) of the probes with their target areexperimentally measured to obtain a training data set (see, examplesection infra) Multiple linear regression may be performed usingHybridization affinities as I [I_(i) I_(m)] to obtain a set of weightcoefficients [W₁ W_(N)] The weight coefficients are then used to predictthe hybridization affinities

Continuing the process in FIG. 3, the predicted or measured probehybridization affinity values may be stored in a database (302) or afile Alternatively, Hybridization affinities may be predicted asrequested Adjusted hybridization intensity values may be calculated(301) as

1/r,

An adjusted hybridization intensity may be calculated for each probe (orprobe pair if the intensity is the difference between a perfect matchprobe and a mismatch probe) In some other embodiments, the adjustedhybridization intensity may not be a simple ratio One of skill in theart would appreciate that other methods for calculating the relative oradjusted hybridization intensities may also be used

The minimal value of the adjusted intensity values (303) may be used asa measurement of gene expression

Computer software products for gene expression analysis are alsoprovided The products may contain a computer readable medium containingcode for performing the methods steps discussed above and in FIG. 3

Computer systems for gene expression analysis are provided The systemshave a processor, a memory coupled to the processor, the memory storingmachine instructions that cause the processor to perform a plurality oflogical steps when implemented by the processor The logic steps includethe analysis steps discussed above

Many embodiments of the invention are particularly useful for analyzinggene expression using nucleic acid probe arrays As described above, sucharrays contain a large number of sets of probes Each set is used formeasuring one transcript In such embodiments, the process in FIG. 3 isrepeated for each probe sets It is generally preferable that all theprobe sets (each is for measuring one transcript) in a probe array areanalyzed using the methods, software or system of the invention.However, in some embodiments, a subset of the probe sets may be analyzedusing the methods, systems and software of the invention

CONCLUSION

The present inventions provide methods and computer software productsfor analyzing gene expression profiles It is to be understood that theabove description is intended to be illustrative and not restrictiveMany variations of the invention will be apparent to those of skill inthe art upon reviewing the above description By way of example, theinvention has been described primarily with reference to the use of ahigh density oligonucleotide array, but it will be readily recognized bythose of skill in the art that other nucleic acid arrays, other methodsof measuring transcript levels and gene expression monitoring at theprotein level could be used The scope of the invention should,therefore, be determined not with reference to the above description,but should instead be determined with reference to the appended claims,along with the full scope of equivalents to which such claims areentitled

All cited references, including patent and non-patent literature, areincorporated herewith by reference in their entireties for all purposes

1. A computer implemented method for determining Hybridization between aplurality of nucleic acid probes and a nucleic acid target comprisinginputting a plurality of hybridization intensities, each of saidintensities reflects said hybridization between one of said plurality ofsaid probes and said nucleic acid target, adjusting said hybridizationintensities for hybridization affinities of said probes to obtain aplurality of adjusted hybridization intensities, finding the minimaladjusted Hybridization intensity among said adjusted hybridizationintensities; and indicating said minimal adjusted hybridizationintensity as a measurement of said hybridization
 2. The method of claim1 wherein said hybridization affinities of said probes are predictedbased upon the sequence of said probes
 3. The method of claim 1 whereinsaid hybridization affinities are inputted from a database
 4. The methodof claim 3 wherein said hybridization affinities are measuredexperimentally
 5. The method of claim 1 wherein said adjustedHybridization intensity are calculated according to${{{Adjusted}\quad{hybridization}\quad{intensity}} = \frac{I}{\Gamma}},$wherein said I is said Hybridization intensity and said Γ is saidhybridization affinity
 6. The method of claim 5 wherein said pluralityof probes have at least 5 probes
 7. The method of claim 6 wherein saidplurality of probes have at least 10 probes
 8. The method of claim 7wherein said plurality of probes have at least 20 probes
 9. A computersoftware product for determining hybridization between a plurality ofnucleic acid probes and a nucleic acid target comprising computerprogram code for inputting a plurality of Hybridization intensities,each of said hybridization intensities reflects said hybridizationbetween one of said plurality of said probes and said nucleic acidtarget, computer program code for adjusting said hybridizationintensities for hybridization affinities of said probes to obtain aplurality of adjusted hybridization intensities, computer program codefor finding the minimal adjusted hybridization intensity among saidadjusted hybridization intensities and computer program code forindicating said minimal adjusted hybridization intensity as ameasurement of said hybridization, and a computer readable media forstoring said code
 10. The computer software product of claim 9 whereinsaid hybridization affinities of said probes are predicted based uponthe sequence of said probes
 11. The computer software product of claim10 wherein said Hybridization affinities inputted from a database 12.The computer software product of claim 11 wherein said hybridizationaffinities are measured experimentally
 13. The computer software productof claim 12 wherein said adjusted Hybridization intensity are calculatedaccording to.${{Adjusted}\quad{hybridization}\quad{intensity}} = \frac{I}{\Gamma}$wherein said I is said hybridization intensity and said Γ is saidhybridization affinity
 14. A computer-readable medium havingcomputer-executable instructions for performing a method comprising,inputting a plurality of hybridization intensities, each of saidintensities reflects said hybridization between one of said plurality ofsaid probes and said nucleic acid target, adjusting said hybridizationintensities for hybridization affinities of said probes to obtain aplurality of adjusted hybridization intensities, finding the minimaladjusted hybridization intensity among said adjusted hybridizationintensities, and indicating said minimal adjusted hybridizationintensity as a measurement of said hybridization
 15. The computerreadable medium of claim 14 wherein said hybridization affinities ofsaid probes are predicted based upon the sequence of said probes
 16. Thecomputer readable medium of claim 15 wherein said hybridizationaffinities are inputted from a database
 17. The computer readable mediumof claim 16 wherein said hybridization affinities are measuredexperimentally
 18. The computer readable medium of claim 17 wherein saidadjusted Hybridization intensity are calculated according to${{{Adjusted}\quad{hybridization}\quad{intensity}} = \frac{I}{\Gamma}},$wherein said I is said hybridization intensity and said Γ is saidhybridization affinity
 19. A system for comparing nucleic acid probes,comprising a processor, and a memory being coupled to the processor, thememory storing a plurality machine instructions that cause the processorto perform a plurality of logical steps when implemented by theprocessor, said logical steps including inputting a plurality ofHybridization intensities, each of said intensities reflects saidhybridization between one of said plurality of said probes and saidnucleic acid target, adjusting said hybridization intensities forhybridization affinities of said probes to obtain a plurality ofadjusted hybridization intensities, finding the minimal adjustedhybridization intensity among said adjusted hybridization intensities,and indicating said minimal adjusted hybridization intensity as ameasurement of said hybridization
 20. The system of claim 19 whereinsaid hybridization affinities of said probes are predicted based uponthe sequence of said probes
 21. The system of claim 20 wherein saidhybridization affinities are inputted from a database
 22. The system ofclaim 21 wherein said hybridization affinities are measuredexperimentally
 23. The system of claim 22 wherein said adjustedhybridization intensity are calculated according to${{{Adjusted}\quad{hybridization}\quad{intensity}} = \frac{I}{\Gamma}},$wherein said I is said hybridization intensity and said Γ is saidhybridization affinity